R package will be? A Prediction model using R Packages featuresHow many times you have found yourself spending long, long hours wrapping up an R package, polishing and pushing it to CRAN, and realize after that almost nobody downloads it and uses it?
If you are Stephanie Hicks or Roger Peng, then the likely answer is:
Never! People always love it!!!
But think about tens of houndreds of aspiring R package authors across the world who might have had to find out that the number of downloads for their R package is upsettingly low.
Here we come, with the How successful your next R package will be? modeling tool that analyzes your package prototype based on its:
and predict a number of downloads it will generate over the time!
The ultimate goal of the project is to develop a predictive model that takes as an input features that can be derived from an “about to be published” package prototype and predicts a number of downloads it will generate over the time.
The secondary objective is to identify what features of an R package derived from package’s features such as: title and description text, metadata, code files content, attached data content, vignettes content etc. are associated with a high number of downloads.
Apart from the undeniable need for such a prediction tool, we decided to work on this project as we have identified a vast range of methodological challenges with it.
R packages (approx. 2,300) from their very first release to CRAN (importantly, we must focus on the package’s first release to make a tool adequate for accessing “about to be published on CRAN” package prototypes).The technical challenges we identified include:
R package metadata, downloads statistics and other statistics,R packages’ description sites (from CRAN) and process them to extract useful information,R packages’ archive’s files (from CRAN) to access information about the package from its first release version,A number of problems required brainstorming and decision making at almost every stage of package development, including:
The project work design can be summarized shortly in the following:
Collect data about R packages that were submitted for the FIRST time to CRAN between Nov 1, 2016, and Sep 30, 2017. It is the relatively recent collection of packages such as:
Perform features engineering to derive potentially useful prediction model features based on feature’s:
Separate train and test subsets of the data.
On train subset of data, train and tune 3 different types of predictive models:
Identify best (based on MSE prediction error on test subset) modeling approach.
Summarize observations.
In this section, we discuss:
Initially, we wanted to understand the features of highly-downloaded packages. We aslo want to build a prediction model based on the features we can collect. Further, we hope we can provide a prediction tool for the package writers.
We realized that there are several platforms that authors could upload their pacakges. At the same time, there are many potential features we can potentially extract for this project. We had a discussion about choosing CRAN or BIOCONDUCTOR as the source of R packages. We decided to use CRAN since the authors come from broader background.
We listed a long list of potential features, and divided the work into several parts, as the data processing we’ll mention in the following session.
We re-defined our goal - “Building a prediction model for number of downloads using the information from the packages.” Further, we used both number of downloads in 90 days, and number of downloads in 1 year as our outcome.
There are so many ways to build our model. We decided to use three methods train our model, including lasso, random forests (rf), and support vector machine (svm).
In this section, we describe the data source, as well as document the data import, wrangling, etc.
The extracted_features_databook.xlsx file (located at 712-final_project/meta/extracted_features_databook.xlsx) is a data book, that is, a file which contains name and human-friendly description of each explanatory variable that has been used in modeling.
Below, we describe in more detail how each explanatory variable and the outome variable have been derived.
We aimed at deriving package’s features based on package’s title and description.
Web scraping:
rvest, we extracted the information about title and package description from the url. Feature extraction:
tm to help us extract information from the web.tm, we first converted the in the data into corpus format.en, and R, (3) we decided to use stemDocument to help us find words with the same stems (For example, we may have calculate, calculation, calulating shown in our title or description. With the help of stemDocument, they were categorized as the same stem “calcul”). Of note, we have a debate that if we should categorize the noun and verbs as the same stem since they may functions differently. However, we haven’t found a package being able to handle this problem yet, so we decided to use stem words.The derived variables meaning should be clear based on human-readable explanation we provide in databook (see Final variables description below).
R Markdown files in which the package documentation-related features are generated are stored in 712-final_project/Rmd_files/2018-11-25-feature_task_part1 directory. The final file was saved in 2018-11-25-feature_task_part1_p4.Rmd
We aimed at deriving package’s features based on package’s documentation and vignette.
Our first “old approach” assumed the variables we derive are based on:
Our “new” (final) approach assumed the variables we derive are based on:
man directory found in Package source zip corresponding to the first release of the package;vignettes directory found in Package source zip corresponding to the first release of the package. As a vignette, we considered any Rmd/Rnw/md document found in vignettes folder in the package directory.The derived variables meaning should be clear based on human-readable explanation we provide in databook (see Final variables description below).
R Markdown files in which the package documentation-related features are generated are stored in 712-final_project/Rmd_files/2018-11-25-feature_task_part3 directory. These include:
2018-11-25-feature_task_part3.Rmd - “old” approach code,2018-11-29-feature_task_part3_REDO.Rmd - “new” (final) approach code.We aimed at deriving package’s features based on package’s meta data files.
All these features are derived based on the DESCRIPTION files present in the Package source zip corresponding to the first release of the package. The package source zips corresponding to the first release of the package for each of approx. 2,300 packages considered have been downloaded to a local machine of one of us so as not to take over the shared Dropbox area. The downloading and processing code stays reproducible.
desc_get_deps function from the desc package.The identification of authors was challenging given that not all first released versions located the authors and descriptions in the @Authors field.
CRANtools function from the tools package to obtain information of the current versions in CRAN.Another challenging feature was the minimum R Version required for the package. This feature usually exists in the field Depends; however that was not the case for many packages.
R versions, and get the older R version.R version, we tracked the package’s date of release and compared to the releases of R versions. We used the closest R version after the Package’s released .The derived variables meaning should be clear based on human-readable explanation we provide in databook (see Final variables description below).
R Markdown file with the code used to derive the features is stored at 712-final_project/Rmd_files/2018-11-25-feature_task_part2/2018-11-25-feature_task_part2.rmd.
We aimed at deriving package’s features based on package’s code and attached data files.
All these features are derived based on files present in Package source zip corresponding to the first release of the package. The package source zips corresponding to the first release of the package for each of approx. 2,300 packages considered have been downloaded to a local machine of one of us so as not to take over the shared Dropbox area. The downloading and processing code stays reproducible.
The derived variables meaning should be clear based on human-readable explanation we provide in databook (see Final variables description below).
R Markdown file with the code used to derive the features is stored at 712-final_project/Rmd_files/2018-12-01-feature_task_from_code.Rmd.
For each package, we derived 2 types of outcomes:
download_cnt_90d- total number of downloads over 0-90 days after the package’s 1st release to CRAN,download_cnt_365d- total number of downloads over 0-365 days after the package’s 1st release to CRAN. The above statistics were derived with the use of cranlogs::cran_downloads function (“Daily package downloads from the RStudio CRAN mirror”).
R Markdown file with code used to derive the outcome is stored at 712-final_project/Rmd_files/2018-11-27-feature_task_outcome/2018-11-27-feature_task_outcome.Rmd.
The feature deriving and engineering work was divided into modules. The final explanatory variable and outcome variable data sets are stored at the following locations:
712-final_project/data/final_dfs/data_x.csv - final explanatory variable data set,712-final_project/data/final_dfs/data_y.csv - final outcome data set.We have 2325 total number of R packages for which we have collected the data.
We have two outcomes that we will model separately:
We have 185 total derived variables (186 columns minus column for package ID) in data_x.csv final explanatory variable data set.
We started with investigating boxplots of the two outcome variables:
download_cnt_90d - total number of downloads over 0-90 days after the package’s 1st release to CRAN,download_cnt_365d - total number of downloads over 0-365 days after the package’s 1st release to CRAN.These are presented on two plots below. The Second plot differs from the first one in that it has Y-axis upper limit cut.
We then investigated the outcomes after logarythmic transformation, as showed below.
Given above, we decided to focus and limit our analysis to variables after log transformation because they exibit more reasonable distribution in the sample we have.
712-final_project/Rmd_files/2018-11-27-feature_task_outcome/2018-11-27-feature_task_outcome.Rmd.In this section, we answer the questions:
We also:
Glance at the potential features via terms cloud obtained from text from packages title (LEFT HAND SIDE) and from packages description (RIGHT HAND SIDE):
Barplot of top frequent terms obtained from text from packages title (LEFT HAND SIDE) and from packages description (RIGHT HAND SIDE):
33% out of 2,325 packages considered have at least one vignette.
The below boxplot shows boxplot of log(# of downloads over 1y) among packages, stratified by whether or not a package has a vignette. We can see that the median of log(# of downloads over 1y) outcome is slightly larger in group with at least one vignette.
712-final_project/Rmd_files/2018-12-12-EDA_features/2018-12-12-EDA_features.Rmd.37% out of 2,325 packages were released as an ‘stable’ version (first digit of number version was 1 or more).
76% out of 2,325 packages were built using Roxygen. The below graph shows a boxplot of log(# of downloads over 1y) among packages, stratified by whether or not Roxygen was used. We can see that the median of log(# of downloads over 1y) outcome is slightly larger in group that used Roxygen.
712-final_project/Rmd_files/2018-11-25-feature_task_part2/2018-11-25-feature_task_part2.rmd and 712-final_project/Rmd_files/2018-12-12-EDA_features/2018-12-12-EDA_features_2_DA.Rmd.The below barplot shows percentage of packages (out of 2,325 packages considered) for which a particular directory was found in their 1st release source files zip:
demo directory,src directory,testthat directory. Interestingly, more than 30% packages had unit tests implemented via testthat package.
The below plots show boxplot of log(# of downloads over 1y) among packages, stratified by whether or not a package has a particular directory. We can see that the median of log(# of downloads over 1y) outcome is noticeably higher in group with testthat directory present.
712-final_project/Rmd_files/2018-12-12-EDA_features/2018-12-12-EDA_features.Rmd.-Use available and latest CRAN packages to get metadata and other features. We confirmed that most packages have had several new versions from first release. This make difficult to establish a correlation between current status of the package and number of downloads at 3 and 12 months. Therefore, we decided to scrap the CRAN archives of each package and collect the first released version.
-Use author data from CRAN, repositories such as PubMed, job information, etc. Given that we use the first released packages in CRAN, most of them did not correctly identified the authors using the @Author label. This made very difficult to find the authors in the package. Furthermore, we found that many authors had homonymous in Pubmed and other repositories, making very difficult their identification and the consequent extraction of features such as number of papers or current job status. Thus, we decided not to include this extra features about package authors. However, we assume that package authors did not change significantly throughout the time, and included information about authors from the current version of the package.
-Random Forest with Repeated Cross Validation. Random Forest estimation using repeated cross validation (n=10) took more than 24 hours when it was stopped. Therefore, we decided for effiency to use Cross Validation only for Random Forest using 10 partitions in each round.
Our team decided that in order to get valid estimates, we should, first and foremost, use the first released package in the CRAN repository. This will assure that the number of downloads (outcome of interest) happen due to initial features of the package; thus, temporality might hold. Also, we selected packages released in a 1 year timeframe to avoid any cohort effects. Similarly, we selected two ‘moments’ for the outcome: number of downloads after 90 days (3 months) and 360 days (1 year). We use all data available from text mining (title and description), package metadata and files included in the package to predict the number of downloads. We decided to use three different approaches: Linear models, Random Forests and Support Vector Machine to increase the prediction power of the model, minimizing the MSE and maxiziming the R2.
In this section, we answer the question:
We used three different approaches to select the best random forest model, using for all instances a 10-fold Cross-Validation to minimize the Predicted Root Mean Square Error (RMSE).
mtry equal to the square root of the number variables in the trainig set.R to find the best mtry parameter.mtry and calculated the perfomance of random forests with 50, 100, 500, 1000 and 2000 trees. We fit the models to selected two outcomes:
download_cnt_90d_LOG- logarithm of total number of downloads over 0-90 days (approx. 3 months) after the package’s 1st release to CRAN,download_cnt_365d_LOG- logarithm of total number of downloads over 0-365 days (approx. 1 year) after the package’s 1st release to CRAN. For download_cnt_90d_LOG outcome, the best tunning results were found in the model tree1000:
1000,mtry parameter: 11,0.6467765,0.10986923. For download_cnt_365d_LOG outcome, the best tunning results were found in the model auto:
500,mtry parameter: 101,0.7625867,0.1572620. download_cnt_90d_LOG outcome on test setThe below plot on LEFT hand side shows observed versus predicted values of logarithm of total number of downloads over 0-90 days on test set.
The below plot on RIGHT hand side shows observed values logarithm of total number of downloads over 0-90 days on test set versus residuals. Clearly, there is association between residual value and observed value of logarithm of total number of downloads over 0-90 days.
The MSE on a test set is 0.457.
The \(R^2\) on a test set is 0.08.
download_cnt_365d_LOG outcome on test setThe below plot on LEFT hand side shows observed versus predicted values of logarithm of total number of downloads over 0-365 days on test set.
The below plot on RIGHT hand side shows observed values logarithm of total number of downloads over 0-365 days on test set versus residuals. Clearly, there is association between residual value and observed value of logarithm of total number of downloads over 0-365 days.
The MSE on a test set is 0.657.
The \(R^2\) on a test set is 0.188.
712-final_project/Rmd_files/2018-12-09-modeling_random_forest/2018-12-09-modeling_rf.Rmd.We have fit two separate Linear Regression models to the two outcomes:
download_cnt_90d_LOG- logarithm of total number of downloads over 0-90 days (approx. 3 months) after the package’s 1st release to CRAN,download_cnt_365d_LOG- logarithm of total number of downloads over 0-365 days (approx. 1 year) after the package’s 1st release to CRAN. In each case, the lambda regularization parameter was chosen in 10-fold 10-repeated cross-validation performed on the test set so as to minimize RMSE.
For download_cnt_90d_LOG outcome, the tunning results were:
0.02872464,0.6627854,0.07133483. For download_cnt_365d_LOG outcome, the tunning results were:
0.02472353,0.7752640,0.1339603. download_cnt_90d_LOG outcome on test setThe below plot on LEFT hand side shows observed versus predicted values of logarithm of total number of downloads over 0-90 days on test set.
The below plot on RIGHT hand side shows observed values logarithm of total number of downloads over 0-90 days on test set versus residuals. Clearly, there is association between residual value and observed value of logarithm of total number of downloads over 0-90 days.
The MSE on a test set is 0.474.
The \(R^2\) on a test set is 0.043.
download_cnt_365d_LOG outcome on test setThe below plot on LEFT hand side shows observed versus predicted values of logarithm of total number of downloads over 0-365 days on test set.
The below plot on RIGHT hand side shows observed values logarithm of total number of downloads over 0-365 days on test set versus residuals. Clearly, there is association between residual value and observed value of logarithm of total number of downloads over 0-365 days.
The MSE on a test set is 0.689.
The \(R^2\) on a test set is 0.114.
712-final_project//Rmd_files/2018-12-09-modeling_linear_model/2018-12-09-modeling_linear_model.Rmd.We have fit two separate models using support vector machine (radial) to the two outcomes:
download_cnt_90d_LOG- logarithm of total number of downloads over 0-90 days (approx. 3 months) after the package’s 1st release to CRAN,download_cnt_365d_LOG- logarithm of total number of downloads over 0-365 days (approx. 1 year) after the package’s 1st release to CRAN. Due to the small variance in the variables about licence, we exclude all the information about licence using select(-starts_with("license_")) for svm model.
We initially tried both support vector machine (linear) and support vector machine (radial) to fit both outcomes. However, it took more than 24 hours to tune the parameters in the support vector machine (linear) for one outcome, so we decided to focus on using support vector machine (radial). In each case, the lambda regularization parameter was chosen in 10-fold 10-repeated cross-validation performed on the test set so as to minimize RMSE.
In each case, the parameters were chosen in 10-fold 10-repeated cross-validation performed on the test set so as to minimize RMSE.
For download_cnt_90d_LOG outcome, the tunning results were:
sigma = 0.0033 and C = 0.53 ,0.6653641,0.08002368. For download_cnt_365d_LOG outcome, the tunning results were:
sigma = 0.00414 and C = 4 ,0.7864176,0.1270128. download_cnt_90d_LOG outcome on test setThe below plot on LEFT hand side shows observed versus predicted values of logarithm of total number of downloads over 0-90 days on test set.
The below plot on RIGHT hand side shows observed values logarithm of total number of downloads over 0-90 days on test set versus residuals. Clearly, there is association between residual value and observed value of logarithm of total number of downloads over 0-90 days.
The MSE on a test set is 0.499.
The \(R^2\) on a test set is 0.074.
download_cnt_365d_LOG outcome on test setThe below plot on LEFT hand side shows observed versus predicted values of logarithm of total number of downloads over 0-365 days on test set.
The below plot on RIGHT hand side shows observed values logarithm of total number of downloads over 0-365 days on test set versus residuals. Clearly, there is association between residual value and observed value of logarithm of total number of downloads over 0-365 days.
The MSE on a test set is 0.672.
The \(R^2\) on a test set is 0.174.
* The below plot shows the parameter tuing process from the model using downloads over 365 days as outcome.
712-final_project//Rmd_files/2018-12-09-modeling_svm/2018-12-14-modeling_svm_90dfinal.Rmd for using downloads over 90 days as outcome and 712-final_project//Rmd_files/2018-12-09-modeling_svm/2018-12-14-modeling_svm_1yfinal.Rmd for using downloads over 365 days as outcome.In this section, we answer the questions:
| Outcome | Methods | MSE | Rsquared |
|---|---|---|---|
| log 90 days downloads | Random Forest | 0.457 | 0.080 |
| log 90 days downloads | Linear Regression (lasso) | 0.474 | 0.043 |
| log 90 days downloads | Support Vector Machine | 0.499 | 0.074 |
| log 365 days downloads | Random Forest | 0.657 | 0.188 |
| log 365 days downloads | Linear Regression (lasso) | 0.689 | 0.114 |
| log 365 days downloads | Support Vector Machine | 0.672 | 0.174 |
Random Forest and Support Vector Machine minimize the most MSE in our test dataset for log of number of downloads in one year. However, support vector machine cannot provide us an explicit information about what features are important for the prediction model.
title (ggplot), authors (number), unit testing (files), description ("interface"), data files.R package developers should include the top features highlighted in our results to potentially increase the ‘success’ of their products.
Feature extractions
There are many features for a package, and we are not able to include all of them. For example, the field that a package is designed for is not taken into account in our analysis. And this feature may also play a role in number of downloads. The other example is that some packages have video tutorials or have been distributed to users through conference workshop.
Outcome definition
Indeed, since there are many potential repositories that a package can be distributed, we are not able to check the downloads on several repositories by ourselves. This may lead to the biased measurement of number of downloads, in both 90 days and 1 year.
Modeling approach
Though we tried several models and took log of the outcome due to non-normality, we can see the positive association between the residuals and the observed values in all three models we applied. This may be due to model misspecification or other unmeasured important predictors.
Generalizability
There are other platforms for the R users to distribute their R packages, such as Bioconductor. Since there are different rules/restrictions about R package deployment, the quality of R and the target users may vary from one platform to the other. We used the R packages stored in CRAN in this project, but making inference on the other platform using this model may be cautious.
From this project, we understand it is very important to have a well-defined question. In the beginning, we tried to build a prediction tool based on the model we tried to build. However, it took us some time to get concensus on several issues:
It turns out that we learned how to use existing tools to facilitate our project, such as parallel computing, text mining, and extracting number of downloads. During the process, we found and felt that time is an important constraint for the project, and the time we spent on tuning parameters was much longer than expected. At the same time, we realized that it was pretty cool to have a shiny app, but in reality, the R writers might benefit more if they could know the features of highly downloaded packages in advance instead of putting their package to the shiny app and knowing the prediction model after they almost finish their packages. Not but the least, this is a very joyful journey for all of us to learn more about R, experience the reality of data science (topic with time constraint), and collaborate with each other.